Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting

نویسندگان

  • Eva Pettersson
  • Beáta Megyesi
  • Joakim Nivre
چکیده

Natural language processing for historical text imposes a variety of challenges, such as to deal with a high degree of spelling variation. Furthermore, there is often not enough linguistically annotated data available for training part-of-speech taggers and other tools aimed at handling this specific kind of text. In this paper we present a Levenshtein-based approach to normalisation of historical text to a modern spelling. This enables us to apply standard NLP tools trained on contemporary corpora on the normalised version of the historical input text. In its basic version, no annotated historical data is needed, since the only data used for the Levenshtein comparisons are a contemporary dictionary or corpus. In addition, a (small) corpus of manually normalised historical text can optionally be included to learn normalisation for frequent words and weights for edit operations in a supervised fashion, which improves precision. We show that this method is successful both in terms of normalisation accuracy, and by the performance of a standard modern tagger applied to the historical text. We also compare our method to a previously implemented approach using a set of hand-written normalisation rules, and we see that the Levenshtein-based approach clearly outperforms the hand-crafted rules. Furthermore, the experiments were carried out on Swedish data with promising results and we believe that our method could be successfully applicable to analyse historical text for other languages, including those with less resources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations

While todays orthography is very strict and seldom changes, this has not always been true. In historical texts spelling of words often not only varies from todays but in some periods even varies from use to use in a single text. Information retrieval on historical corpora can deal with these variations using fuzzy matching techniques based on Levenshtein-Distance using stochastic weights. In pa...

متن کامل

Levenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach

Dictionary lookup methods are popular in dealing with ambiguous letters which were not recognized by Optical Character Readers. However, a robust dictionary lookup method can be complex as apriori probability calculation or a large dictionary size increases the overhead and the cost of searching. In this context, Levenshtein distance is a simple metric which can be an effective string approxima...

متن کامل

Rule-based normalisation of historical text - A diachronic study

Language technology tools can be very useful for making information concealed in historical documents more easily accessible to historians, linguists and other researchers in humanities. For many languages, there is however a lack of linguistically annotated historical data that could be used for training NLP tools adapted to historical text. One way of avoiding the data sparseness problem in t...

متن کامل

Statistical Language Modeling for Historical Documents using Weighted Finite-State Transducers and Long Short-Term Memory

The goal of this work is to develop statistical natural language models and processing techniques based on Recurrent Neural Networks (RNN), especially the recently introduced Long ShortTerm Memory (LSTM). Due to their adapting and predicting abilities, these methods are more robust, and easier to train than traditional methods, i.e., words list and rule-based models. They improve the output of ...

متن کامل

Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification

This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013